Removing the bad apples: A simple bioinformatic method to improve loci?recovery in de novo RADseq data for non?model organisms
نویسندگان
چکیده
The establishment of high-throughput sequencing, together with bioinformatic processing tools—the genomics revolution—has impacted biology during the last decade by resolving long-standing questions in phylogenetics (Abalde et al., 2019; Rochette 2014; Struck 2011), speciation and adaptation (Birkeland 2020; Ravinet 2017, 2018; Weber 2019), opening new venues research such as genome-wide structural variants (Catchen Faria 2019). While genome-level data have become widely accessible, population-level (i.e. population genomics) species-level phylogenomics) inference remains challenging due to limited number high-quality genomes costs associated sequencing analysing large datasets. These challenges encouraged development reduced-representation (RRS), where genomic complexity is reduced only a portion genome. Chief among RRS ‘Restriction site-Associated DNA Sequencing’ (RADseq; Baird 2008; Davey family techniques which involve digesting using type-II restriction enzymes flanking regions cut site. Benefiting from distribution sites over genome, RADseq-based approaches are cost time efficient, typically providing thousands independent loci for inference. For instance, al. (2019) estimated that price single whole genome resequenced three-spined stickleback Gasterosteus aculeatus, >100 individuals may be sequenced at similar depth RADseq, would cover ~3% Since rely on existence along conservation site critical importance recovering shared different (Eaton 2017; Huang & Lacey Knowles, 2016; O'Leary 2018). Allele dropout occurs when given locus or allele not one more individuals, it result biological divergence—when mutation modifies Rates thereby expected correlated divergence between lineages (Crotti Eaton However, also artefacts experimental design, sampling bias low sequence coverage; problems library preparation, issues enzyme digestion size selection, human error; extraction since, some organismal groups, extracting still non-trivial their presence chemical compounds interfere extraction; analyses, clustering reads In fact, originating these technical can sometimes exceed origin under certain conditions (Rivera-Colón 2020). Whatever case be, high translates rates missing dataset, dramatically influence frequency dataset (Arnold 2013; Gautier Hodel 2017), phylogenetic reconstruction 2017). Attempts minimize posed include suggestions parameter optimization control (Paris Catchen, data-filtering exploration (O'Leary 2018), data-cleaning thresholds prospective retrospective simulation based reference available Despite these, retrieving an optimal RADseq experiment pose challenges. On hand, expertise involving selection PCR rounds (broadly preparation) scarce biologists working non-model systems. other post-sequencing filter lead pruning informative (Huang Lee retention particular characteristics datasets varying levels species (Dinc? Here, we suggest simple method mitigate allelic mixed biases design data. Simply put, subgroups, level (Figure 1), users will better distinguish sources dropout. most RAD studies comprised metapopulation, composed several subpopulations, analyses focus optimizing parameters across metapopulation whole. Instead, here directly each subpopulation subspecies. Using this approach four datasets, identified degree (hereafter bad apples) removed them final analysis comprising all populations 1). To test suggested pipeline, used including: (a) ddRADseq meiofaunal annelid Stygocapitella zecae (J. Cerca al, unpubl. data; 21 samples, six populations); (b) single-digest Euhadra molluscs (Richards 2017); 16 species); (c) Antarctic sponge Dendrilla antarctica (Leiva 62 seven (d) hyRADseq Anthochaera phrygia combining museum modern samples (Crates 230 eight populations). A graphical overview provided Figure 1. We opted subgroup population-by-population species-by-species) pipeline because removal populations—especially those divergent. By running subgroup-level explore level, operating scale derive molecular sequencing. processed de novo implemented Stacks v2.41 (Rochette began -M (number mismatches allowed stacks within individuals) -n following Paris (2017)'s optimize Stacks. Essentially, involves combinations determining obtained, choosing ‘right’ space - i.e. avoiding over-splitting over-merging selected 2 every exception phrygia, 3 3. parameters, ran wrapper generating complete unclean datasets; 1, steps 1–2). After obtaining individually Stygocapitella, step 3), separately (using default applying Anthochaera, was used; Tables S1–S4), generated variant call format (VCF) file each. obtained information individual vcftools (Danecek 2011) (--missing-indv option; S1–S4; 4). coverage common problem genomic-level studies, retrieved (--depth S1–S4). With mind, labelled keep remove (bad apples), general strategy included: retaining minimum two per species; designating threshold data, average (?40% ?30% Euhadra, ?65% Dendrilla, ?40% Anthochaera). Notice kept since they valuable historical specimens 5). done recreate typical ancient-DNA study. Additionally, there very range -r 0.2 (minimum percentage required process population) dataset. identification apples, three datasets: clean, hybrid-clean random. clean (that specimen apple) involved rerunning (starting ustacks). included same (all specimens) but reusing output represented 6. assembles first u/c/s/gstacks, then filtered module. Therefore, run behind includes apples excluded filtering step. Finally, understand overall impact removing specimens, performed 10 random runs, detected; however, were haphazardly. aim assess effect determine differences unclean, options, (regardless SNPs locus) thresholds, % (Table S5) explored whether kept/removed loci. Specifically, obtain loci, fixed -p must present locus; 4 zecae, spp., phrygia) -R 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% 100% locus). constant varied relies varies remaining estimates vcftools, reported above apples. both determined relative difference evaluated (clean, random) value words, first, dataset; second, divided percentages multiplying 100. restricted least 100 small denominators generate values even facing changes. example, 1 generates ones. terms, increase but, practical does translate improvement. Moreover, do numbers usually consider hundreds stages up 0.6 considered 1.0 0.5 S5). affected classes comparing share set assembled (catalogue Stacks), diverge stage. so, converted present/absence plotted (e) concern genetically distinct may, various reasons, occur nature. this, principal component (PCA) labelling specimens. tested deviations nucleotide diversity (?) Watterson's estimator (?) uncleaned, hybrid-cleaned cleaned PCA chosen its calculation assumes locus. As result, middle PC axis. Deviation ? ? should provide further evidence individuals. carry out PCA, --write-random-SNP option while (vcf) populations, so linkage disequilibrium removed. vcf loaded R package vcfR (Knaus Grünwald, carried functions adegenet (Jombart Ahmed, 2011). calculations, fasta sequences populations. custom perl unix scripts, split into turn, according Locus1_populationA; Locus 1_populationB; … N_population_X). From selecting nine including five specimens). DNAsp v6 (Rozas calculated loci-by-loci ?, well averages population. There wide variation terms (SNPs) sample referred simply data) (Tables Nonetheless, correlate. extreme Ardtoe has total 37,028 SNPs, end Lødingen 2,821 1; SNP differences, missingness 53% 55% respectively. two, Henningsvær (53% data), higher lowest 22% found Cutty Sark. According established protocol label 97% 71% 60% 94%. No Sark S1). Kristineberg Musselburgh, aforementioned ranges, 68% respectively total, 29% removed; Table This improved except Sark, no strongest change could observed population, decreased 28% missingness. Importantly, strictly correlate coverage. Kristineberg, highest (80× 75×) apple. Of When separately, aomoriensis missingness, 33,866 E. quaesita yielded (40,493) 23% senckenbergiana had 19,126 10% aomoriensis, 44% 34% quaesita, 30% (threshold 29%; S2). (12%, S2), wanted 15 (27% steepest decrease correlation (and therefore degrees 39% Den_PAR 77% Den_FIL applied (‘-r’ flag, ‘-r 0.2’ Den_OH left. 2,086 636) applied. Den_ROT 6,317 without (decreasing 3,299 after filtering). before less substantial Den_CIE population; specific, 5,155 4,389 64% S3). was, cases, quite rigorous Den_CIE, many 31 (50% Due substantially 70% 41%. Similar strict multiple having ranging 76 QLD 2,656 NA extent comparable (50%), NNSW NVIC 33% S4). namely 41 (18% S4), led retained substantial. SVIC 40% 31% instead agreement correspond general, cases much 300% than 2). Notably, increases decreasing cannot merely attributed smaller discussed above, yield hybrid-cleaned, despite confirms negative effects noticeable pronounced Euhadra. relatively, best analysed lower contrast excellent easily indicates room improvement performed. Another interesting observation restrictive settings (-R 0, 0.1), greatest Considering three, seems good dataset) reaches clear, effect. Except obvious retrieve nearly values. perform differently values, performing slightly settings. performs hybrid-clean. respect exceptions (0%) 3). clearly worse results exclusion, exclusion hybrid-cleaned. settings, exclusion. explained being presented S5), little cleaning approaches, hybrid-clean, performance appears -R, loci) outperforms always improves three. increased majority relationship measurements negative. Hence, vice versa. generally different. improve, them. contrast, positive both, increases. different, clear pattern observed. Removal appear against class caused evident loci: 3,875; 321 22,753; 21,140; 77 4,967. suggests driven inherent stochasticity presence/absence rather systemic did outliers pulled algorithm perceptible represent central positions PCs few extremities removed, illustrated DEN_OH, DEN_PAR (populations Dendrilla), ACT, ADL Anthochaera; subgroup-runs Changes S6). similarly comparisons. considering together, increasing and/or attributable randomly, comparison approaches. Namely, performances, (Figures 2-4). seem predictable, consistent depended investigated. replicate resource consuming studied threefold 2), 18%–50% retrieval thoroughly, thus ‘high-quality’ collection consisting highly diverged benefits, conducted carefully. First, principle requires carefully part design. determination priori ‘population maps’, lack precision, inclusion minor/deviant genetic background. particularly difficult for, marine limits (Cerca Hellberg, 2009), morphologically (cryptic species) overlooked (Struck Cerca, 2018) potentially minority landscape studies. Second, interest. hybridization incomplete lineage sorting contributes shifts frequencies (Sætre Ravinet, If divergent alleles admixed wrongly pruned out. Third, precious, A. if yielding rate practice, need identified, targeted occupying intermediate ordinates. note occurred, non-intermediate concerns applicable recommend researchers analyse either through simulation-based assessments 2020) decompose components, guaranteed Current strategies improving laboratory practices bioinformatics, however work case. quantities desirable, achieve taxa. amplification concentration. powerful microscopic eukaryotes, nonetheless, introduce (de Medeiros Farrell, >100-year-old included, fragmented low-concentration DNA. libraries and, therefore, needed. downstream below identifying variance read Yet, case, coverage, strands over-represented duplication, translating second biased properties. Thus, methods filters proposed allows distinguishing dropout, benefits genetics stem (mutation site; 2016) preparation (population-by-population species-by-species), able aside isolating stemming targets poorly prepared samples. Lowering significant estimation statistics. (expressed inflated FST heterozygosity, deflated FIS, comparisons simulated empirical Inflation metrics because, sizes, intra-population tend higher. mitigation priority designing projects. works report slight compared exact drivers changes, likely conclude correct ?. Phylogenetic benefit Best too stringent permissive 2019) as, (Lee conservative exclude fast-evolving jeopardizing resolution terminal branches jeopardize signal noise ratio blurred An important our differ any way, suggesting recovery reduction building matrix allow inferences. optimized currently Stacks, expect pipelines focusing analysis. expectation pipelines, ipyrad dDocent Overcast, Puritz 2014), framework files (e.g. genotypes, vcf), implementing results, easy implement authors thank Ana Riesgo, Carlos Leiva Martinez, Sergio Taboada, George Olah, Angus Davidson Ross Crates kindly grateful Emma Falkeid Eriksen illustrations provided. J.Ce. dedicates paper vision, dedication hard who Forbio—the Norwegian Research School Biosystematics. Without workshops, courses financial support, been possible. fight climate justice. partially supported Godfrey Hewitt mobility award European Society Evolutionary Biology (ESEB), UiO:Life Sciences Internationalization fund—two initiates permitted visit J.Ca., N.C., A.R.-C. N.R. Urbana Champaign. Visiting UIUC greatly benefited J.Ce.'s skills. anonymous reviewers comments manuscript. NHM Genomics contribution 23. J.Ce., T.H.S. designed study; M.F.M., advice infrastructure A.R.-C., J.Ca. drafted All approved draft. peer review history article https://publons.com/publon/10.1111/2041-210X.13562. downloaded public repositories. Richards (2017); (2019); (2019). made Nucleotide Archive (ENA) project id PRJEB40223. Specimen-id cross-matched S1 work, name ‘Submitted FTP’ ENA. Please note: publisher responsible content functionality supporting supplied authors. Any queries (other content) directed corresponding author article.
منابع مشابه
A Simple Method to Improve Hazelnut Grafting
Hazelnuts are usually propagated by suckers or layering. However, other methods of propagation have been tested with variable results. Grafting is a method that has a few advantages such as fast multiplication rate, earlier fruiting and reducing sucker removal cost on specific rootstocks. Unfortunately, grafting is difficult due to slow formation of callus in hazelnuts. Here, we report the hypo...
متن کاملPyRAD: assembly of de novo RADseq loci for phylogenetic analyses
MOTIVATION Restriction-site-associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an ...
متن کاملA simple method suitable to study de novo root organogenesis
De novo root organogenesis is the process in which adventitious roots regenerate from detached or wounded plant tissues or organs. In tissue culture, appropriate types and concentrations of plant hormones in the medium are critical for inducing adventitious roots. However, in natural conditions, regeneration from detached organs is likely to rely on endogenous hormones. To investigate the actio...
متن کاملAftrRAD: a pipeline for accurate and efficient de novo assembly of RADseq data.
An increase in studies using restriction site-associated DNA sequencing (RADseq) methods has led to a need for both the development and assessment of novel bioinformatic tools that aid in the generation and analysis of these data. Here, we report the availability of AftrRAD, a bioinformatic pipeline that efficiently assembles and genotypes RADseq data, and outputs these data in various formats ...
متن کاملa new approach to credibility premium for zero-inflated poisson models for panel data
هدف اصلی از این تحقیق به دست آوردن و مقایسه حق بیمه باورمندی در مدل های شمارشی گزارش نشده برای داده های طولی می باشد. در این تحقیق حق بیمه های پبش گویی بر اساس توابع ضرر مربع خطا و نمایی محاسبه شده و با هم مقایسه می شود. تمایل به گرفتن پاداش و جایزه یکی از دلایل مهم برای گزارش ندادن تصادفات می باشد و افراد برای استفاده از تخفیف اغلب از گزارش تصادفات با هزینه پائین خودداری می کنند، در این تحقیق ...
15 صفحه اولذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Methods in Ecology and Evolution
سال: 2021
ISSN: ['2041-210X']
DOI: https://doi.org/10.1111/2041-210x.13562